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Abstract. Let T> ={di,d,2, ...do} be a given set of D string documents of total length n, our 
task is to index T>, such that the k most relevant documents for an online query pattern P of 
length p can be retrieved efficiently. We propose an index of size |£75M| + n log D(2 + o(l)) bits 
and 0(t 3 (p) + k log log n+poly log log n) query time for the basic relevance metric term- frequency, 
where |CSA| is the size (in bits) of a compressed full text index of T>, with 0(t a (p)) time for 
searching a pattern of length p . We further reduce the space to |CS'j4.| + n log 15(1 + o(l)) bits, 
however the query time will be 0(t s (p) + fc(log cr log log n) 1+e + poly log log n), where a is the 
alphabet size and e > is any constant. 



1 Introduction and Related Work 

Document retrieval is a special type of pattern matching that is closely related to informa- 
tion retrieval and web searching. In this problem, the data consists of a collection of text 
documents, and given a query pattern P, we are required to report all the documents in 
which this pattern occurs (not all the occurrences). In addition, the notion of relevance is 
commonly applied to rank all the documents that satisfy the query, and only those documents 
with the highest relevance are returned. Such a concept of relevance has been central in the 
effectiveness and usability of present day search engines like Google, Bing, Yahoo, or Ask. 
When relevance is considered, the query has an additional input parameter k, and the task is 
to report the k documents with the highest relevance to the query pattern (in the decreasing 
order of relevance), instead of finding all the documents that contain the query pattern (as 
there may be too many). More formally, let T> ={d\, cfe, ■■■do} denote a given set of D string 
documents to be indexed, whose total lengths is n, and let P denote a query pattern of length 
p. Let occ be the number of occurrences of this pattern over the entire collection T>, and ndoc 
be the number of documents out of D in which the pattern P appears. One of the main issues 
is the fact that k <C ndoc <C occ. Thus, it is important to design indexes which do not have 
to go through all the occurrences or even all the documents in order to answer a query. 

The research in string document retrieval was introduced by Matias et al. [19] . and 
Muthukrishnan [211 formalized it with the introduction of relevance metrics like term- frequency 
(tf) and min-distQ and proposed indexes with efficient query performance. Since then, this 
has been an active research area |28|29j . The top-k document retrieval problem was intro- 
duced in [12], where an 0{n log re)-word index is proposed with 0(p + /c + lognloglogn) query 
time for the case when the relevance metric is term- frequency. A recent flurry of activities 
in this area |25|15|8|2|4|26|23|17|13|24] came with Hon et al.'s work [14] where they gave a 
linear-space index with 0(p + /clog k) query time, which works for a wide class of relevance 
metrics. The recent structure by Navarro and Nekrich [22] achieves optimal 0(p + k) query 
time using (9(n(log cr + log D + log log n)) bits, which improves the results in |14| in both space 



3 tf(P,d) is the number of occurrences of P in d and min-dist(P, d) is the minimum distance between two 
occurrences of P in d 



and time. If the relevance metric is term- frequency, their index space can be further improved 
to 0(n(log(7 + log-D)) bits. All these interesting results have contributed towards the goal 
of achieving an optimal query time index. However, the space is far from optimal, moreover 
the constants hidden in the space bound can restrict the use of these indexes in practice. On 
the other side, the succinct index proposed by Hon et al. [14] takes about 0(log 4 n) time to 
report each document, which is likely to be impractical. This time bound has been further 
improved by |2|8j . but still polylog(n) time is required per reported document. Another line of 
work is to derive indexes using about n log D bits additional space, and the best known index 
takes a per document report time of 0(log k log 1+<E n) [2]. Efficient practical indexes are also 
known [3], but their query algorithms are heuristics with no worst-case bound. In this paper, 
we introduce two space efficient indexes with per document report time po/y-log-logarithmic 
in n. The main results are summarized as follows. 

Theorem 1 There exists an index of size \CSA\ + n\ogD(2 + o(l)) bits with a query time 
of 0(t s (p) + k log log n + poly log log n) for retrieving top-k documents with the highest term 
frequencies, where \CSA\ is the size (in bits) of a compressed full text index ofT> with 0(t s (p)) 
time for searching a pattern of length p. 

Theorem 2 There exists an index of size \CSA\ +nlog-D(l + o(l)) bits with a query time of 
0(t s (p) + fc(log a log logn) 1+<E +poly log log n) for retrieving top-k documents with the highest 
term frequencies, where \CSA\ is the size (in bits) of a compressed full text index ofT> with 
0(t s (p)) time for searching a pattern of length p, a is the alphabet size and e > is a constant. 

Table Q] gives a summary of the major results in the top-fe frequent document retrieval 
problem. The time complexities are simplified by assuming that we are using the full text 
index proposed by Belazzougui and Navarro, of size |CS^4| = nH^ + 0(n) + o(n logo") bits 
and t s (p) = 0(p), where is the hth order empirical entropy of T> [I]. We also assume 
D < n £ for some e < 1 and e > is any constant. 



Table 1. Indexes for Top- A; Frequent Document Retrieval 



Source 


Index Space (in bits) 


Time per reported document 


12 


0(n log n + nlog^ D) 


O(l) 


.14 


0(n log n) 


O(logfc) 


A 


\CSA\+nlogD(l + o(X)) 


Unbounded 


.14 


2\CSA\+o{n) 


0(log 4+s n) 


[2] 


2\CSA\+o(n) 


0(log k log z+€ n) 


m 


\CSA\+0(^^) 


0(log a+e n) 


m 


\CSA\+0(^^) 


0(log k log 2+f n) 





\CSA\ +0(n log log log D) 


0(log k log 2+E n) 


22 


0(n log a + n log D) 


O(l) 


E 


\CSA\ +nlogD + o(n) 


Oilog^ n) 


m 


\CSA\ + n log D + o(n) 


0(log k log i+£ n) 


Ours 


\CSA\+2nlogD(l + o(l)) 


0(log logn) 


Ours 


\CSA\+nlogD(l + o(l)) 


0((logaloglogn) i+e ) 



2 Preliminaries 



2.1 Top-fc Using Range Maximum/Minimum Queries 

One of the main tools in top-k retrieval is the range maximum /minimum query structures 
(RMQ) [6j. We summarize the results in the following lemmas (We defer the proofs to the 
Appendix A and B respectively). 

Lemma 1. Let A[l...n] be an array of n numbers. We can preprocess A in linear time 
and associate A with a 2n + o(n) bits RMQ data structure such that given a set of t non- 
overlapping ranges [Li,R\], [L2, R2], ■ ■ ■ , [Lt, Rt], we can find the largest (or smallest) k num- 
bers in A[Li..R%] U A[L2-.R2\ U ■ ■ ■ U A[Lf..Rf] in unsorted order in 0{t + k) time. 

Lemma 2. Let A[l...n] be an array of n integers taken from the set [1, ir\, and each number 
A[i] is associated with a score (which may be stored separately and can be computed in t score 
time). Then the array A can be maintained in 0(n log ir) bits, such that given two ranges 
[x',x"], [y',y"], and a parameter k, we can search among those entries A[i] with x' < i < x" 
and y' < A[i] < y" , and report the k highest scoring entries in unsorted order in 0((log7r + 
k) (log tt + t score )) time. 

3 A Brief Review of Hon et al.'s Index 

In this section we give a brief description of Hon et al.'s index [13]. Let T = di#c?2# • • • 
be a text obtained by concatenating all the documents in T>, separated by a special symbol # 
not appearing elsewhere inside any of the d{S. Then the suffix tree [30 2 0|18] of T is called the 
generalized suffix tree GST of T>. Then any given substring T[a...b] (which does not contain 
jf) of T is a substring of some document d x G T>, and the value of x can be computed in O(l) 
time by maintaining an (n + D)(l + o(l))-bit auxiliary data structural Each edge in GST is 
labeled by a character string and for any node u, the path label of u, denoted by path(u) is 
the string formed by concatenating the edge labels from root to u. Note that the path label 
of the ith leftmost leaf in GST is exactly the ith lexicographically smallest suffix of T. For 
a pattern P[l..p] that appears in T, the locus node of P is denoted by locus(P), which is 
the unique node closest to the root such that P is a prefix of path(locus(P)), and can be 
determined in 0(p) time. We augment the following structures on GST. 

N-structure: An N-structure entry is a triplet (doc, score, parent) and is associated with 
some node in GST. If u is a leaf node with path(u) is a suffix of document d, the an N-structure 
entry with doc = d is stored at u. However, if it is an internal node, multiple N-structure 
entries may be stored at u as follows: an entry with doc = d is stored if and only if at least 
two children of u contain (a suffix of) document d in their subtrees. The score field in an N- 
structure entry for a document d associated with a node u is score(path(u) , d): the relevance 
score of d with respect to the pattern pai/i(ujf|. The parent field stores (the pre-order rank 
of) the lowest ancestor of u which has an entry for document d in its N-structure. In case 
there is no such ancestor, we assign a dummy node which is regarded as the parent of the 
root of GST. 

I-structure: An I-structure entry is a triplet (doc, score, origin) and is associated with 
some node in GST. If node u has an N-structure entry for document d and an N-structure 

4 Maintain a bit vector B[l...(n + D)], where B[i] = 1 if and only if T[i] — then x — ranks{a) + 1 and can 
be computed in O(l) time using [27]. 

The score is dependent only on d and the set of occurrences of path{u) in d. 



entry of another node v is given by (d,score(path(y),d),u), then u will have an I-structure 
entry (d,score(path(v),d),v). An internal node may be associated with multiple I-structure 
entries, and these entries are maintained in an array, sorted by the origin field. In addition, 
a range maximum query (RMQ) structure is maintained over the array based on the score 
field. 

3.1 Query Answering 

To answer a top-fc query, we first search for the query pattern P in GST and find its locus 
node locus(P). We also find the rightmost leaf Iocusr{P) in the subtree of locus(P). Now, 
our task is to find, among the documents whose suffixes appear in the subtree of locus(P), 
which k of them have the highest occurrences of P. Hon et al. showed that this can be done 
by checking only the I-structure entries associated with the proper ancestors of locus(P), 
and then retrieving those k entries which has the highest score values and whose origin is 
from the subtree of locus(P) (inclusively). The number of ancestors of P is bounded by p 
and since the I-structure entries are sorted according to the origin values, the entries to be 
checked will occupy a contiguous region in the sorted array. The boundaries of the contiguous 
region can be obtained by performing a binary search based on (the pre-order ranks of) 
locus(P) and Iocusr(P). Once we get the boundaries of the contiguous region in each proper 
ancestors of locus(P), we can apply RMQ queries repeatedly over score and retrieve the top-fc 
scoring documents in sorted order in 0(p log n + k log k) time. The binary search step can be 
made faster by maintaining a predecessor structure [31] and the resulting time will become 
0(plog log n + k log k). This time has been further improved to 0(p + k log k) by introducing 
two additional fields Sf and bi in each N-structure entry. The number of N-structure entries 
(hence I-structure entries) is < In. Therefore the index space is 0{n log n) bits. 

4 Our Linear-Space Index 

In this section, we derive a modified version of Hon et al.'s linear index without 5 fields 
and still achieve 0(p) term in query time. The main technique is by introducing a novel 
criterion that categorizes the I-structure entries as near and far. The far entries associated 
with certain nodes can be maintained together as a combined I-structure, which reduces the 
number of I-structure boundaries to be searched to 0(p/tt + tt), where tt is a sampling factor. 
By choosing tt = log log n, we shall use predecessor search structure (instead of 5 fields) and 
can compute the I-structure boundaries in 0((p/tt + 7r)loglogn) = 0(p + log 2 log n) time. 
We have the following result. 

Theorem 3 There exists an index of size O(nlogn) bits for top-k document retrieval with 
0(p + log 2 log n + k log log log n + k log k) query time. 

Proof. Firstly, we mark all nodes in GST whose node-depths are multiples of tt (node-depth of 
root is 0). Thus, any unmarked node is at most it nodes away from its lowest marked ancestor. 
Also, the number of marked ancestors of any node = [(number of ancestors) / tt~\ . For any node 
w in GST, we define a value C( w ) < TT; where ((w) = if w is marked, else it is the number 
of nodes in the path from w (exclusively) till its lowest marked ancestor (inclusively). In each 
I-structure entry (d,s,v) associated with a node w, we maintain a fourth component ((w). 
Next, we categorize the I-structure entries as far and near as follows: 

An I-structure entry associated with a node w, with origin = v, is near if there exists 
no marked node in the path from v (inclusively) to w (exclusively), else it is far. 



We restructure the entries such that all far entries are maintained in a combined I-structure 
associated with some marked nodes as follows: if (d, s, v, ((w)) is a far entry in the I-structure 
I w associated with node w, then we remove this entry from I w and move to a combined I- 
structure associated with the node u, where u = w if w is marked, else u is the lowest marked 
ancestor of w (i.e., u is C( w ) nodes above w). All the entries in the combined I-structure 
are maintained in the sorted order of origin values. A predecessor search structure over the 
origin field and RMQ structure over the score field is maintained over all I-structures. Next, 
to understand how to answer a query with our index, we introduce the following auxiliary 
lemma. 

Lemma 3. The top-k documents corresponding to a pattern P can be obtained by checking 
the following I-structure entries (with origins coming from the subtree of locus(P)) : 

(i) near entries in the regular I-structures associated with the nodes in the path from locus(P) 
( exclusively) till its lowest marked ancestor u ( inclusively) , and there are at most ir such nodes; 

(ii) far entries with £ < ^(locus(P)) in the combined I-structure of u, and 

(iii) far entries in the combined I-structures associated with the marked proper (at mostp/n) 
ancestors of u. 

Proof. In the original index by Hon et al., we need to check the I-structure entries in all 
ancestors of locus(P). We may categorize them as follows: 

(a) near entries associated with a node in the subtree of u (inclusively); 

(b) far entries associated with a node in the subtree of u (inclusively); 

(c) far entries associated with an ancestor node of u; 

(d) near entries associated with an ancestor node of u. 

All entries in (a) belong to category (i) in the lemma. The valid entries in (b) belong to 
category (ii), where the inequality £ < Q(locus(P)) ensures that the all entries in category 
(ii) were originally from an ancestor of locus(P) . All those entries in (c), which may be a 
possible candidate for the top-k documents, belong to category (iii) in the lemma. None of 
the entries in (d) can be a valid output, as the origin of those entries are not coming from the 
subtree of u (from the definition of a near entry), hence not from the subtree of locus(P). On 
the other hand, since we always check for the entries with origins coming from the subtree 
of locus (P), these entries must be a subset of those checked in the original index by Hon et 
al. In conclusion, the entries checked in both indexes are exactly the same, and the lemma 
follows. □ 



Based on the above lemma, we may compute k candidate answers from each category and 
the actual top-A; answers can be computed by comparing the score of these 3k documents. In 
category (i) we have at most it boundaries to be searched, which takes 0(7rloglogn) time, 
and then retrieve the k candidate answers in the unsorted order in 0(k + k) time using 
lemma 1. Similarly in category (iii), the number of I-structure boundaries to be searched is 
p/n and it takes total 0((p/ir) log log n + k) time. However, for category (ii), we have an 
additional constraint on Q value of the entries. To facilitate the process, the £ components are 
maintained by the data structure in Lemma 2 in 0(n log n) bits, so that the desired answers 
can be reported in 0((log7r + fc)(log7r + 0(l))) time. The O(klogk) is for sorting the answers. 
The time for initial pattern search is 0(p). Putting all together with it = log log n, we obtain 
Theorem 3. □ 



5 Space-Efficient Encoding of Our Index 



In this section, we derive a space-efficient index for the relevance metric term- frequency. The 
major contribution is that, instead of using O(logn) bits for an I-structure entry, we design 
some novel encodings so that each entry requires only log-D + log7r + 0(1) bits. The GST 
will be replaced by a compressed full text index CSA of size |C5A| bits |ll|5|10|i] along 
with the tree encoding of GST in 4n + o(n) bits ppl . Thus locus(P) can be computed in 
0(p) time by taking the LCA (lowest common ancestor) of leftmost and rightmost leaf in the 
suffix range of P. 

A core component of our index is the document array Da, where -D^fz] stores the id of 
document to which the ith smallest suffix in GST belongs to. The Da can be maintained in 
n log-D + Q( iogiog^ ) bits and can answer the following queries in O (log log D) time [9]. (i) 
access(i): returns -D^fi]; (h) rank(d,i): returns the number of occurrences of document d in 
..«']; (hi) selected, j): is —1 if j > \d\, else i where Da[i\ = d and rank(d,i) = j. Now 
we show how to use Da for efficient encoding and decoding of different components in an 
I-structure entry. 

Term-frequency Encoding: Given an I-structure entry with origin = v and doc = d, the 
corresponding term-frequency score is exactly the number of occurrences of d in DA[i---j], 
where i and j are the leftmost leaf and the rightmost leaf of v, respectively. Thus, given the 
values v and d, we can find i and j in constant time based on the tree encodings of the GST, 
and then compute term-frequency in 0(iog log-D) time based on two rank queries on Da- 
Thus, we will discard the score field completely for all I-structure entries, but keeping only 
the RMQ structure over it. 

Origin Encoding: Origin encoding is the most trickiest part, and is based on the following 
observation by Hon et. al [14] : for any document d and for any node v in GST, there is at 
most one ancestor of v that contains an I-structure entry with doc = d and origin from a 
node in the subtree of v (inclusively). We introduce two separate schemes for encoding origin 
fields in near and far entries. This reduces the origin array space from 0(n log n) bits to 
0(n) bits and decoding takes 0(log log-D) time. 

Encoding near entries: Let I w be a regular I-structure (with only near entries) associated 
with a node w and let w q represents the pre-order rank of q th child of w. Then from the 
definition of I-structures, for a given document d, there exists at most one entry in /„, with 
doc = d and origin from the sub-tree of w q (inclusively). Thus, for a given document d 
and an internal node w, an entry in /„, can be associated to a unique child node w q of w 
(where w q represent the gth child of w from left, 1 < q < degree(w), and pre-order rank 
of w q can be computed in constant time PS]), such that origin is in the subtree of w q . 
Moreover, this origin must be the node, closest to root, in the subtree of w q which has an re- 
structure entry for d. From the definition of N-structure, this origin node must be the lowest 
common ancestor (LCA) of the leaves corresponding to the first and last suffixes of d in the 
subtree of w q , which can be computed using the tree encoding of GST and a constant number 
of rank/select operations on Da in total 0(log log-D) time. Therefore, by maintaining the 
information about w q (origin-child = q) for each I-structure entry, the corresponding origin 
value can be decoded in 0(log log-D) time. Thus, the origin array can be replaced completely 

6 Any n-node ordered tree can be represented in 2n + o(n) bits, such that if each node is labeled by its 
pre-order rank in the tree, any of the following operations can be supported in constant time [16] : parent(i), 
which returns the parent of node i; child(i,q), which returns the g-th child of node i; child-rank(i) , which 
returns the number of siblings to the left of node i; lca(i,j), which returns the lowest common ancestor of 
two nodes i and j; and lmost-leaf(i) /rmost-leaf(i) , which returns the leftmost/rightmost leaf of node i. 



by the origin- child array. Recall that each node maintains the I-structure entries in sorted 
order of the origins, so that the corresponding origin-child array will be monotonic increasing. 
In addition, the value of each entry is between 1 and degree(w), so that the array can be 
encoded using a bit vector of length \I W \ + degree(w^\. The total size of the bit vectors 
associated with all nodes can be bounded by Y1w£GSt(\Iw\ + degree(w)) = 0(n) bits. The 
0(n log n) bits predecessor search structure over origin array is replaced by a structure of 
o(n) bits space and O(loglogn) search tim^l. 

Encoding far entries: In order to encode the origin values in far entries, we introduce the 
following notions. Let w* be a marked node, then another node w* is called its qth marked 
child, if w* is the qth smallest (in terms of pre-order rank) marked node with w* as its lowest 
marked ancestor. Given the pre-order rank of w* , the pre-order rank of w* can be computed 
in constant time by maintaining an additional O(n) bits structure^ Let I w * represents the 
combined I-structure (with only far entries) associated with a marked node w*. The origin 
value of any far entry in I w * is always a node in the subtree of some marked child w* of w*, 
and is always unique for a given q and doc = d. Thus by maintaining the information about w* 
(origin- child* = q), we can decode the corresponding origin value for a particular document 
d. i.e. origin is the LCA of the leaves corresponding to the first and last suffix of d in the sub- 
tree of w*, which can be computed using the tree encoding of GST and a constant number of 
rank/select operations on Da in total O(loglogD) time. Now origin array can be replaced 
by origin-child* array, which can be encoded in S™*eGST* (1-^* I + degree(w*)) = 0(n) bits 
(using the similar scheme for encoding origin-child array for near entries). The predecessor 
search structure is replaced by o(n) bits sampled predecessor search structure. 

Query Answering: Query answering algorithm remains the same as that in our lin- 
ear index, except the fact that decoding origin and term-frequency takes 0(loglogZ?) time. 
Then the time complexities for the steps in Lemma 3 are as follows: Step (i) 0((7rloglogn + 
fe)loglogL>), Step (ii) 0((log7r + fc)(log7r + loglogD)) and Step (iii) (((p/vr) loglogn + 
k) log log D). Since the term-frequencies are positive integers < n, we shall use a y-fast trie [3l] 
to get the sorted answer in 0(k log log n) time. By choosing ir = log 2 log n, the query time can 
be bounded by 0(t s (p) + p + log 4 log n + k log log n), which gives the query time in Theorem 
1. Here t s (p) is the time for initial pattern searching in CSA, and is Q(p) for space-optimal 
CSA's [ETC] . 

Space Analysis: The index consists of a full text index of |C5^4| bits, Da of n log D(l + 
o(l)) bits, I-structures of total 2ra(log-D + 0(log7r) + 0(1)) bits, tree encodings, RMQ 
structures and sampled predecessor search structures (together O(n) bits). By choosing 
7r = log 2 log n, the index space can be bounded by |05A|+nlog D(3+o(l))+0(nlog log log n) 
bits. In order to obtain the space bounds in Theorem 1, we may categorize D into the following 
two cases. 



7 A monotonic increasing sequence S = 1333445 can be encoded as B = 101100010010 in + o(l)) bits, 
where S[i] = ranki(selecto(i)) on B, and can be computed in constant time |27| . 

8 Construct a new array by sampling every log 2 nth element in the original array, and maintain predecessor 
search structure over it. Now, when we perform the query, we can first query on this sampled structure to 
get an approximate answer, and the exact answer can be obtained by performing binary search on a smaller 
range of only log 2 n elements in the original array. The search time still remains O (log log n). 

9 Let GST* be a tree induced by the marked nodes in GST, so that w* is the lowest marked ancestor of w* 
in GST if and only if the node corresponding to w* in GST* (say, w) is the parent of node corresponding to 
w* (say w q ) in GST*. Moreover, Wg is said to be the qth marked child of node w* in GST, if w q is the gth 
child of q in GST*. Given the pre-order rank of any marked node in GST, its pre-order rank in GST* (and 
vice versa) can be computed in constant time by maintaining an additional bit vectors of size 2n + o(n) 
which maintain the information if a node is marked or not. 



1. When log D/ log log D > log log log n, the 0(n log log log n) term can be absorbed in 
o(n log D). The space can be further reduced by nlogD bits from the following observa- 
tion that the term-frequency is 1 for those I-structure entries with origin = a leaf in GST, 
and there are n such entries. Therefore all such entries can be deleted and in case if such a 
document is within top-fc, that can be reported using document listing. For that we shall 
use Muthukrishnan's chain array idea [21]. The chain array C[l...n] is defined as follows: 
C[i] = j, where j < i is the largest number with -Da[*] = -^[j] and can be simulated 
using Da as j = select(DA[i], rank(DA[i], i) — 1) in O(loglog-D) time. Thus we do not 
maintain chain array, instead an 2n + o(n) = o(n log D) bits RMQ structure [6] over it. 
Let [L, R] be the suffix range of P in the full text index, then document listing can be 
performed (in O(loglogD) time per document) by reporting all those documents 
such that L < i < R and C[i] < L using repeated RMQ's. Although those documents 
with frequency > 1 will get retrieved again (but only once), it will not affect the overall 
time complexity. 

2. When log D / log log D < log log log n, we shall use the index described in Theorem 4. Thus 
the space-query bounds will be |C5A| + nlog-D(l + o(l)) bits and 0(t s (p) + log log n + 
k log D log 2 log D) = 0(t s (p) + k log log n) respectively. 

By combining the above case, we get the result in Theorem 1. □ 

Theorem 4 There exists an index of size \CSA\ +nlog-D(l + o(l)) bits with a query time of 
0(t s (p) + log log n + k log D log 2 log D) for retrieving top-k documents with the highest term 
frequencies for a query pattern P of length p. 

Proof. See Appendix C. 
6 Saving More Space 

The most space-efficient version of our index (described in theorem 2) is proved in this section. 
First, we give the following auxiliary lemma (see Appendix D for proof). 

Lemma 4. There exists an 0(n log a log log n) bits structure, which can answer access/rank/select 
queries on Da in 0(log 2 logn) time, and can compute an entry C[i] in the chain-array data 
structure (for document listing) in O(loglogn) time. 

To achieve space reduction, we categorize D into the following cases: 

1. log-D < (log clog log ra) 1+e / 2 : We shall use the index described in Theorem 4 and the 
query time will be 0{t s (p) + /c(logcrloglogra) 1+e ). 

2. logD > (log a log log n) 1+e / 2 : In this case Da is replaced by a structure described in 
Lemma 4, which makes the index space relog-D(l + o(l)) bits. Then by re-deriving the 
bounds with tt = log 3 logn, our query time will be 0(t s (p) + log 6 logra + /clog 2 logn). 
The 0(k log 2 logn) term can be further improved to 0(k log logn) from the following 
observation that, once we get the I-structure boundaries, we do not need any information 
about the origin fields for further query processing. Thus the only value needed is the 
term- frequency, which can be computed as follows: a sampled document array D S A is 
maintained, such that -Da[^] = d is stored if and only if (rank£> A (d,i))mod p = 0, for 
an integer p = 6>(logD), else we store a NIL value, where rank£> A {d,j) is the number 
of occurrences of d in D^fl...^']. Then D S A can be maintained in 0(n log D/a) = 0(n) 



bits and can compute an approximate rank. That is p rank D 2^(d, j) < rankzi A (d, j) < 
p rank D 2^(d, j) + p. Thus associated with each I-structure entry, we shall store this error 
(= ©(logD)), which is equal to actual term-frequency minus approximate term-frequency 
(computed using D A ). Thus by storing this error corresponding to each I-structure entry 
in total 0(n log p) = 0(n log log D) = o(nlogD) bits space, the term-frequency can be 
obtained in O (log log D) = O (log log n) time by first computing the approximate term- 
frequency using D S A and then by adding this stored value. Note that for the initial I- 
structure boundary searches, the origin decoding is performed using the structure in 
Lemma 4. Moreover, this structure can compute chain array values in O(loglogn) time, 
which can be used for document listing in O(loglogn) time per report (when the I- 
structure entries with term-frequency = 1 are deleted from the index, and later such a 
document is an answer for a query). 

By combining the above cases, we obtain an |C!L4| + nlogD(l + o(l)) bits index with 
query time 0(t s (p) + k(log a loglogn) 1+e + log 6 logn), which completes the proof of Theorem 
2. □ 
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A Proof of Lemma 1 

In [T3], Hon et al. described an 0(t + k log /c)-time algorithm for retrieving the k largest 
numbers in the sorted order. However, if sorted order is not necessary, the time can be 
improved to 0{p + k) based on the following result of Frederickson [7]: The kth largest 
number from a set of numbers maintained in a binary max heap A can be retrieved in 0(k) 
time by visiting 0(k) nodes in A. In order to solve our problem, we may consider a conceptual 
binary max heap A as follows: Let A' denote the balanced binary subtree with t leaves that 
is located at the top part of A (with the same root). Each of the t — 1 internal nodes in 
A' holds the value oo. The ith leaf node l{ in A' (for i = 1,2, ..£) holds the value A[JVfj], 
which is the maximum element in the interval A[Li..Ri\. The values held by the nodes below 
li will be defined recursively as follows: For a node t storing the maximum element A[jVf] 
from the range A[L..R], its left child stores the maximum element in A[L..(M — 1)] and its 
right child stores the maximum element in A[(M + 1)..R]. Note that this is a conceptual heap 
which is built on the fly, where the value associated with a node is computed in constant time 
based on the RMQ structures only when needed. Therefore, we first find the (i — 1 + k)th 
largest element X in this heap by visiting 0(t + k) nodes (with 0{t + k) RMQ queries) using 
Frederickson 's algorithm. Then, we obtain all those numbers in A which are > X in 0(t + k) 
time by a pre-order traversal of A, such that if the value associated with a node is < X, we 
do not check the nodes in its subtree. From those retrieved numbers, we delete all the oos 
and then separate out the k largest elements in 0(t + k) time. 

B Proof of Lemma 2 

In order to answer the above query, we maintain A in the form of a wavelet tree [11 j . which 
is an ordered balanced binary tree of n leaves, where each leaf is labeled with a symbol in II, 
and the leaves are sorted alphabetically from left to right. Each internal node w q represents 
an alphabet set II q , and is associated with a bit-vector B q . In particular, the alphabet set of 
the root is II, and the alphabet set of a leaf is the singleton set containing its corresponding 



symbol. Each node partitions its alphabet set among the two children (almost) equally, such 
that all symbols represented by the left child are lexicographically (or numerically) smaller 
than those represented by the right child. 

For a node w q , let A q be a subsequence of A by retaining only those symbols that are in 
II q . Then B q is a bit-vector of length \A q \, such that B q [i] = if A q [i] is a symbol represented 
by the left child of w q , else B q [i] = 1. Indeed, the subtree from w q itself forms a wavelet tree of 
A q . To reduce the space requirement, the array A is not stored explicitly in the wavelet tree. 
Instead, we only store the bit- vectors B q , each of which is augmented with Raman et al.'s 
scheme [27] to support constant-time rank/select operations. The total size of the bit-vectors 
and the augmented structures in a particular level of the wavelet tree is n(l + o(l)) bits. 
We maintain an additional range maximum query (RMQ) [6j structure over the score of all 
elements of the sequence A q (in 0(|j4 g |) bits). As there are log7r levels in the wavelet tree, 
the total space is 0(n log n) bits. Note that the value of any A q [i] for any given w q and i can 
be computed in 0(log7r) time by traversing log7r levels in the wavelet tree. Similarly given 
any range [x'...x"] can be translated to w q as [a^-.a;^] in 0(log7r) time, where A[ 
subsequence of A[x' ...x"] with only those elements in B q . 

The desired k highest scoring entries can be answered as follows: Firstly the given range 
[y',y"] can be split into at most 21og7r disjoint subranges, such that each subrange is rep- 
resented by n q associated with some internal node w q . All the numbers in the subsequence 
A q associated with such an internal node w q will satisfy the condition y' < A q [i] < y". And 
for all such (at most 21og7r) A q s, the range [x',x"] can be translated into the corresponding 
range [x q ,x q ] in 0(log 2 7r) time. Now, we can apply Lemma 1 (where t < 21og7r) to solve the 
desired query. However, retrieving a node value in the conceptual max heap (in the proof of 
Lemma 1) requires us to compute the score of A q [i] for some w q and i on the fly, we shall 
do so by first finding the entry A[i'\ that corresponds to -A 9 [i], and then retrieving the score 
of A[i']. This takes 0(log7r + t score ) time, so that the total query time will be bounded by 

0(l0g 2 7T + (21og7T + fc)(log7T + t score )) = 0((l0g7T + fc)(log7T + t score )). 

C Proof of Theorem 4 

A simple index can be derived based on the succinct framework proposed by Hon et al. [H] and 
Gagie et al [8], which consists of the compressed version of GST (CSA and tree encoding) 
and the document array Da (of nlogD + 0( ^°^ D ) bits space with rank /select /access 
capabilities in O(loglogD) time for any d € T> [9]). Also, for a particular value q to be 
defined we group every g = q log D log log D leaves in the GST together (from left to right) 
and mark the lowest common ancestor (LCA) of all these leaves. Further, we mark the LCA 
of all pairs of marked nodes. Thus the number of marked nodes in GST can be bounded by 
0(n/g) [13] • For each marked node, we maintain the top-q documents in its subtree explicitly, 
which takes 0{n/g x qlogD) = 0{n/ log log D) bits. We perform this marking and store the 
top-q answers for q — 1,2, 4, which takes O ( log kfg^D ) bits of storage space. The total index 
space can thus be bounded by \CS A\+0(n)+ nlogD +0( £$£d ) = \CS A\+n log D{l+o(l)) 
bits (assuming D > ^log log n) 

In order to retrieve top-A; answers corresponding to a suffix range, we first search for P in 
GST and obtain its locus node locus(P) in 0{p) time. Further we round the value of k to the 
next highest power of 2, say q. Now we search for a marked node locus* (P) (corresponding 
to this q), which is same as locus(P) if locus(P) is marked, else it is the highest marked 
descendent of locus(P). Let [L, R\ be the suffix range of locus(P) and [L*,R*] be the suffix 



range of locus*(P), then the leaves corresponding to the ranges [L,L* — 1] and [R* + 1,R] 
are called fringe leaves. It is easy to show that the number of fringe leaves is at most 2g 
(see [II])- Hence, in order to retrieve the top- A; answers, we first check the top-g answers 
stored at locus* (P) (and compute their scores), and then retrieve the score of each of the 2g 
documents corresponding to the fringe leaves. Recall that the score of a document d is the 
frequency of P in d, which can be computed in O(loglogD) time. Thus, the total time can 
be bounded by 0(t s (p) + (g + k) log log D) = 0(t s (p) + k log D log 2 log D). 

Next, we find the top-k answers from this candidate set of 2g + q < 2g + 2k documents. 
As there may be repetitions in the set, we first remove the repetitions by scanning the set 
once (using an auxiliary bit vector of length D to mark if we have already seen a document). 
After that, we find the document d which has the kth highest frequency using 0(k + g) = 
O(fclog-Dloglog-D) time [3]. Finally, we isolate the top-k answers in unsorted order based 
on the score of d, and sort them in 0(k\ogk) = 0(klogD) time. If D < y^og log n, we can 
retrieve the term-frequency of all documents in T> and trivially find the top-k documents in 
0(D log D) = O(loglogre) time. Putting all together, the over all query time can be bounded 
by 0(t s (p) + log log re + k log D log 2 log D) . 

D Proof of Lemma 4 

Let CSA be the compressed suffix array corresponding to the suffix array associated with 
GST. Let t sa and t-ga denote the time for computing SA[i] (starting position of ith smallest 
suffix of T) and the time for computing SA -1 ^'] (the rank of the jth suffix T[j...n] among all 
suffixes of T), respectively. Hon et al. [33] showed that the above operations on Da can be 
simulated by an index of size 2|C5A| +o(re), and the best query time complexities are due to 
Belazzougui and Navarro in [2]. We conclude the results in the following lemma. We maintain 
CSA corresponding to GST and the compressed suffix arrays CSAj (for d = 1, 2, 3, . . . , D) 
corresponding to each individual document. Now, access(i) can be obtained by returning 
SA[i] in CSA. For select(d,j), we first compute the jth smallest suffix in CSA^, and obtain 
the position pos of this suffix within document d, based on which we can easily obtain the 
position pos' of this suffix within the concatenated text of all documents. After that, we 
compute S , t4 _1 [pos / ] in CSA as the desired answer for select(d, j). By doing a binary search 
on select, rank(d, i) can be obtained in 0{{t sa -\-tss) log re) time. This time can be improved to 
0((t sa + t-ga) log log re) as follows: At every log 2 nth leaf of each CSAj, we explicitly maintain 
its corresponding position in CSA and maintain a predecessor structure over it [31]. The size 
of this additional structure is o(n) bits. Now, when we perform the query, we can first query 
on this predecessor structure to get an approximate answer, and the exact answer can be 
obtained by performing binary search on a smaller range of only log 2 n leaves. By choosing 
the 0(nlogoToglogn)-bits space CSA by Grossi and Vitter [ID] , where t sa and tsa takes 
O(loglogre) time, we obtain the lemma. 

An entry in chain array C[i] = j, if j < i is the largest number with Da[i\ = Da[]] = 
(say d) and is NIL if there is no such j. We shall use the following steps to compute j: using 
SA[i] compute the starting position of lexicographically ith. smallest suffix of the concatenated 
text and the corresponding d value. Let this be the lexicographically ijth smallest suffix of 
d, then (i^ — l)th smallest suffix of d can be computed using an SA^ and SA^ 1 operations. 
Further we map this text position in d back to the concatenated text and perform an 5^4 _1 
operation on it to obtain j. The total time can be bounded by O(loglogn). 



